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Abstract: 

Newly  emerging  robotics  applications  for  domestic  or  entertainment  pur¬ 
poses  are  slowly  introducing  autonomous  robots  into  society  at  large.  A 
critical  capability  of  such  robots  is  their  ability  to  interact  with  humans, 
and  in  particular,  untrained  users.  This  paper  explores  the  hypothesis  that 
people  will  intuitively  interact  with  robots  in  a  natural  social  manner  pro¬ 
vided  the  robot  can  perceive,  interpret,  and  appropriately  respond  with 
familiar  human  social  cues.  Two  experiments  are  presented  where  naive 
human  subjects  interact  with  an  anthropomorphic  robot.  Evidence  for  mu¬ 
tual  regulation  and  entrainment  of  the  interaction  is  presented,  and  how 
this  benefits  the  interaction  as  a  whole  is  discussed. 


1.  Introduction 

New  applications  for  domestic,  health  care  related,  or  entertainment  based 
robots  motivate  the  development  of  robots  that  can  socially  interact  with,  learn 
from,  and  cooperate  with  people.  One  could  argue  that  because  humanoid 
robots  share  a  similar  morphology  with  humans,  they  are  well  suited  for  these 
purposes  -  capable  of  receiving,  interpreting,  and  reciprocating  familiar  social 
cues  in  the  natural  communication  modalities  of  humans. 

However,  is  this  the  case?  Although  we  can  design  robots  capable  of 
interacting  with  people  through  facial  expression,  body  posture,  gesture,  gaze 
direction,  and  voice,  the  robotic  analogs  of  these  human  capabilities  are  a  crude 
approximation  at  best  given  limitations  in  sensory,  motor,  and  computational 
resources.  Will  humans  readily  read,  interpret,  and  respond  to  these  cues  in 
an  intuitive  and  beneficial  way? 

Research  in  related  fields  suggests  that  this  is  the  case  for  computers  [1] 
and  animated  conversation  agents  [2].  The  purpose  of  this  paper  is  to  explore 
this  hypothesis  in  a  robotic  media.  Several  expressive  face  robots  have  been 
implemented  in  Japan,  where  the  focus  has  been  on  mechanical  engineering 
design,  visual  perception,  and  control.  For  instance,  the  robot  in  the  upper  left 
corner  of  figure  1  resembles  a  young  Japanese  woman  (complete  with  silicone 
gel  skin,  teeth,  and  hair  [5].  The  robot’s  degrees  of  freedom  mirror  those  of 
a  human  face,  and  novel  actuators  have  been  designed  to  accomplish  this  in 
the  desired  form  factor.  It  can  recognize  six  human  facial  expressions  and  can 
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Figure  1.  A  sampling  of  robots  designed  to  interact  with  people.  The  far  left, 
picture  shows  a  realistic  face  robot  designed  at  the  Science  University  of  Tokyo. 
The  middle  left,  picture  shows  WE-3RII,  an  expressive  face  robot  developed  at 
Waseda  University.  The  middle  right  picture  shows  Robita,  an  upper- torso 
robot  also  developed  at,  Waseda  University  to  track  speaking  turns.  The  far 
right  picture  shows  our  expressive  robot,  Kismet ,  developed  at  MIT.  The  two 
leftmost  photos  are  courtesy  of  Peter  Menzel  [8]. 

mimic  them  back  to  the  person  who  displays  them.  In  contrast,  the  robot 
shown  in  the  upper  right  of  corner  of  figure  1  resembles  a  mechanical  cartoon 
[6].  The  robot,  gives  expressive  responses  to  the  proximity  and  intensity  of  a 
fight  source  (such  as  withdrawing  and  narrowing  its  eyelids  when  the  light  is 
too  bright).  It,  also  responds  expressively  to  a  limited  number  of  scents  (such 
as  looking  drunk  when  smelling  alcohol,  and  looking  annoyed  when  smoke  is 
blown  in  its  face).  The  lower  right  picture  of  figure  1,  shows  an  upper-torso 
humanoid  robot  (with  an  expressionless  face)  that  can  direct  its  gaze  to  look  at 
the  appropriate  person  during  a  conversation  by  using  sound  localization  and 
head  pose  of  the  speaker  [7]. 

In  contrast,  the  focus  of  our  research  has  been  to  explore  dynamic,  expres¬ 
sive,  pre-linguistic,  and  relatively  unconstrained  face  to  face  social  interaction 
between  a  human  and  an  anthropomorphic  robot  called  Kismet  (see  lower  right 
of  figure  1 ).  For  the  past  few  years,  we  have  been  investigating  this  question  in  a 
variety  domains  through  an  assortment  of  experiments  where  naive  human  sub¬ 
jects  interact  with  the  robot.  This  paper  summarizes  our  results  with  respect  to 
two  areas  of  study:  the  communication  of  affective  intent  and  the  dynamics  of 
proto-dialog  between  human  and  robot.  In  each  case  we  have  adapted  the  the¬ 
ory  underlying  these  human  competencies  to  Kismet,  and  have  experimentally 
studied  how  people  consequently  interact  with  the  robot.  Our  data  suggests 
that,  naive  subjects  naturally  and  intuitively  read  the  robot’s  social  cues  and 
readily  incorporate  them  into  the  exchange  in  interesting  and  beneficial  ways. 
We  discuss  evidence  of  communicative  efficacy  and  entrainment,  that  results  in 
an  overall  improved  quality  of  interaction. 

2.  Communication  of  Affective  Intent 

Human  speech  provides  a  natural  and  intuitive  interface  for  both  communi¬ 
cating  with  humanoid  robots  as  well  as  for  teaching  them.  Towards  this  goal, 
we  have  explored  the  question  of  recognizing  affective  communicative  intent 
in  robot-directed  speech.  Developmental  psycholinguists  can  tell  us  quite  a 
lot,  about  how  preverbal  infants  achieve  this,  and  how  caregivers  exploit,  it  to 


regulate  the  infant’s  behavior.  Infant-directed  speech  is  typically  quite  exag¬ 
gerated  in  the  pitch  and  intensity  (often  called  motherese).  Moreover,  mother’s 
intuitively  use  selective  prosodic  contours  to  express  different  communicative 
intentions.  Based  on  a  series  of  cross- Unguis  tic  analyses,  there  appear  to  be 
at  least  four  different  pitch  contours  (approval,  prohibition,  comfort,  and  at- 
tentional  bids),  each  associated  with  a  different  emotional  state  [9].  Figure  2 
illustrates  these  four  prosodic  contours. 
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Figure  2.  Fernald’s  prototypical  prosodic  contours  for  approval,  attentional 
bid,  prohibition,  and  soothing. 


Mothers  are  more  likely  to  use  faffing  pitch  contours  than  rising  pitch 
contours  when  soothing  a  distressed  infant  [10],  to  use  rising  contours  to  elicit 
attention  and  to  encourage  a  response  [11],  and  to  use  bell  shaped  contours  to 
maintain  attention  once  it  has  been  established  [12].  Expressions  of  approval 
or  praise,  such  as  “Good  girl!”  are  often  spoken  with  an  exaggerated  rise- fall 
pitch  contour  with  sustained  intensity  at  the  contour’s  peak.  Expressions  of 
prohibitions  or  warnings  such  as  “Don’t  do  that!”  are  spoken  with  low  pitch 
and  high  intensity  in  staccato  pitch  contours.  Fernald  suggests  that  the  pitch 
contours  observed  have  been  designed  to  directly  influence  the  infant’s  emotive 
state,  causing  the  child  to  relax  or  become  more  vigilant  in  certain  situations, 
and  to  either  avoid  or  approach  objects  that  may  be  unfamiliar  [9]. 

Inspired  by  these  theories,  we  have  implemented  a  recognizer  for  distin¬ 
guishing  the  four  distinct  prosodic  patterns  that  communicate  praise,  prohibi¬ 
tion,  attention,  and  comfort  to  preverbal  infants  horn  neutral  speech.  We  have 
integrated  this  perceptual  ability  into  our  robot’s  emotion  system ,  thereby  al¬ 
lowing  a  human  to  directly  manipulate  the  robot’s  affective  state  which  is  in 
turn  reflected  in  the  robot’s  expression. 

2.1.  The  Classifier  Implementation 

As  shown  in  figure  3,  the  affective  speech  recognizer  receives  robot-directed 
speech  as  input.  The  speech  signal  is  analyzed  by  the  low-level  speech  process¬ 
ing  system,  producing  time-stamped  pitch  (Hz),  percent  periodicity  (a  measure 
of  how  likely  a  frame  is  a  voiced  segment),  energy  (dB),  and  phoneme  values1 

*This  auditory  processing  code  is  provided  by  the  Spoken  Language  Systems  Group  at 
MIT.  For  now,  the  phoneme  information  is  not  used  in  the  recognizer. 


in  real-time.  The  next  module  performs  filtering  and  pre-processing  to  reduce 
the  amount  of  noise  in  the  data.  The  pitch  value  of  a  frame  is  simply  set  to  0  if 
the  corresponding  percent  periodicity  indicates  that  the  frame  is  more  likely  to 
correspond  to  unvoiced  speech.  The  resulting  pitch  and  energy  data  are  then 
passed  through  the  feature  extractor,  which  calculates  a  set  of  selected  fea¬ 
tures  (F\  to  Fn).  Finally,  based  on  the  trained  model,  the  classifier  determines 
whether  the  computed  features  are  derived  from  an  approval,  an  attentional 
bid,  a  prohibition,  soothing  speech,  or  a  neutral  utterance. 
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Figure  3.  The  spoken  affective  intent  recognizer. 


2.1.1.  Training  the  System 

Two  female  adults  who  frequently  interact  with  Kismet  as  caregivers  were 
recorded.  The  speakers  were  asked  to  express  all  five  affective  intents  (ap¬ 
proval,  attentional  bid,  prohibition,  comfort,  and  neutral)  during  the  inter¬ 
action.  Recordings  were  made  using  a  wireless  microphone,  and  the  output 
signal  was  sent  to  the  low-level  speech  processing  system  running  on  Linux. 
For  each  utterance,  this  phase  produced  a  16-bit  single  channel,  8  kHz  signal 
(in  a  .wav  format)  as  well  as  its  corresponding  real-time  pitch,  percent  period¬ 
icity,  energy,  and  phoneme  values.  All  recordings  were  performed  in  Kismet’s 
usual  environment  to  minimize  variability  of  environment-specific  noise.  Sam¬ 
ples  containing  extremely  loud  noises  (door  slams,  etc.)  were  eliminated,  and 
the  remaining  data  set  were  labeled  according  to  the  speakers’  affective  intents 
during  the  interaction.  There  were  a  total  of  726  utterances  in  the  final  data 
set  —  approximately  145  utterances  per  class. 

2.1.2.  Data  Preprocessing 

The  pitch  value  of  a  frame  was  set  to  0  if  the  corresponding  percent  period¬ 
icity  was  lower  than  a  threshold  value.  This  indicates  that  the  frame  is  more 
likely  to  correspond  to  unvoiced  speech.  Even  after  this  procedure,  observa¬ 
tion  of  the  resulting  pitch  contours  still  indicated  the  presence  of  substantial 
noise.  Specifically,  a  significant  number  of  errors  were  discovered  in  the  high 
pitch  value  region  (above  500  Hz).  Therefore,  additional  preprocessing  was  per¬ 
formed  on  all  pitch  data.  For  each  pitch  contour,  a  histogram  of  ten  regions  was 
constructed.  Using  the  heuristic  that  the  pitch  contour  was  relatively  smooth, 
it  was  determined  that  if  only  a  few  pitch  values  were  located  in  the  high  region 
while  the  rest  were  much  lower  (and  none  resided  in  between),  then  the  high 
values  were  likely  to  be  noise.  Note  that  this  process  did  not  eliminate  high 
but  smooth  pitch  contour  since  pitch  values  would  be  distributed  evenly  across 
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Figure  4.  Fernald’s  prototypical  prosodic  contours  found  in  the  preprocessed 
data  set.  Notice  the  similarity  to  those  shown  in  figure  2. 


nearby  regions. 


2.1.3.  Classification  Method 


In  all  training  phases  each  class  of  data  was  modeled  using  a  Gaussian  mix¬ 
ture  model,  updated  with  the  EM  algorithm  and  a  Kurtosis-based  approach 
for  dynamically  deciding  the  appropriate  number  of  kernels  [13].  Due  to  the 
limited  set  of  training  data,  cross-validation  in  all  classification  processes  was 
performed.  Specifically,  a  subset  of  data  was  set  aside  to  train  a  classifier  using 
the  remaining  data.  The  classifier’s  performance  was  then  tested  on  the  held- 
out  test  set.  This  process  was  repeated  100  times  per  classifier.  The  mean  and 
variance  of  the  percentage  of  correctly  classified  test  data  were  calculated  to 
estimate  the  classifier’s  performance. 


Feature 

Description 

Fi 

Pitch  mean 

f2 

Ptich  Variance 

f3 

Maximum  Pitch 

f4 

Minimum  Pitch 

Fs 

Pitch  Range 

F6 

Delta  Pitch  Mean 

Fr 

Absolute  Delta  Pitch  Mean 

Fa 

Energy  Mean 

f9 

Energy  Variance 

Fw 

Energy  Range 

Fn 

Maximum  Energy 

F 12 

Minimum  Energy 

Table  1.  Features  extracted  in  the  first-stage  classifier.  These  features  are 
measured  over  the  non-zero  values  throughout  the  entire  utterance.  Feature 
Fq  measures  the  steepness  of  the  slope  of  the  pitch  contour. 

2.1.4 •  Feature  Selection 

As  shown  in  figure  4,  the  preprocessed  pitch  contour  in  the  labeled  data  resem¬ 
bles  Fernald’s  prototypical  prosodic  contours  for  approval,  attention,  prohibi¬ 
tion,  and  comfort /soothing.  A  set  of  global  pitch  and  energy  related  features 
(see  table  1)  were  used  to  recognize  these  proposed  patterns.  All  pitch  features 
were  measured  using  only  non-zero  pitch  values.  Using  this  feature  set,  a  se¬ 
quential  forward  feature  selection  process  was  applied  to  construct  an  optimal 
classifier.  Each  possible  feature  pair’s  classification  performance  was  measured 
and  sorted  from  highest  to  lowest.  Successively,  a  feature  pair  from  the  sorted 
list  was  added  into  the  selected  feature  set  to  determine  the  best  n  features  for 
an  optimal  classifier.  Table  2  shows  the  results  of  the  classifiers  constructed 
using  the  best  eight  feature  pairs.  Classification  performance  increases  as  more 
features  are  added,  reaches  maximum  (78.77  percent)  with  five  features  in  the 
set,  and  levels  off  above  60  percent  with  six  or  more  features.  It  was  found  that 
global  pitch  and  energy  measures  were  useful  in  roughly  separating  the  pro¬ 
posed  patterns  based  on  arousal  (largely  distinguished  by  energy  measures)  and 
valence  (largely  distinguished  by  pitch  measures).  However,  further  processing 
was  required  to  distinguish  each  of  the  five  classes  distinctly. 

Accordingly,  the  classifier  consists  of  several  mini-classifiers  executing  in 
stages.  In  the  beginning  stages,  the  classifier  uses  global  pitch  and  energy  fea¬ 
tures  to  separate  some  of  the  classes  into  pairs  (in  this  case,  clusters  of  soothing 
along  with  low-energy  neutral,  prohibition  along  with  high-energy  neutral,  and 
attention  along  with  approval  were  formed).  These  clustered  classes  were  then 
passed  to  additional  classification  stages  for  further  refinement.  New  features 
had  to  be  considered  to  build  these  additional  classifiers.  Using  prior  informa¬ 
tion,  a  new  set  of  features  encoding  the  shape  of  the  pitch  contour  was  included, 
which  proved  useful  in  further  separating  the  classes. 


Feature 

pair 

Feature  set 

Performa 

nee  mean 

Performa 

nee 

%  error 
approval 

%  error 
attention 

%  error 
prohibition 

%  error 
soothing 

%  error 
neutral 

FI 

F? 

F1F9 

7209 

0.08 

48.67 

24.45 

8.70 

15.58 

4213 

FI 

F10 

F1F9F10 

75.17 

012 

41  67 

25  67 

965 

13  15 

33  98 

FI 

Fit 

F1F9F10 

Fll 

7813 

008 

29.85 

27  20 

8  80 

10  63 

32  90 

F2 

F9 

FI  F2F9 

_f.iQ.EU.  ■ 

70.77 

011 

2915 

22  23 

853 

12  55 

33.68 

FI 

F2 

F3 

F9 

FI  F2F3 
F9F10FU 

61.52 

1.10 

63  87 

43  03 

908 

23  05 

53  35 

FI 

F8 

F1F2F3 

F8  F9  F10 
Fll 

6227 

1.81 

60.58 

39  60 

16  40 

24  18 

47  90 

F5 

F9 

F1F2F3 

F5F8F9 

F10F11 

65.93 

072 

57  03 

3215 

1213 

19  73 

49  35 

Table  2.  The  performance  (the  percent  correctly  classified)  is  shown  for  the  best 
pair-wise  set  having  up  to  eight  features.  The  pair-wise  performance  was  ranked 
for  the  best  seven  pairs.  As  each  successive  feature  was  added,  performance 
peaks  with  five  features  (78.8%),  but  then  drops  off. 
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Figure  5.  Feature  space  of  all  five  classes  with  respect  to  energy  variance,  Fg, 
and  pitch  mean,  F\.  There  are  three  distinguishable  clusters  for  prohibition, 
soothing  and  neutral,  and  approval  and  attention. 


To  select  the  best  features  for  the  initial  classification  stage,  the  seven 
feature  pairs  listed  in  table  2  were  examined.  All  feature  pairs  worked  better 
in  separating  prohibition  and  soothing  than  other  classes.  The  F\-Fg  pair 
generates  the  highest  overall  performance  and  the  least  number  of  errors  in 
classifying  prohibition.  Several  observations  can  be  made  from  the  feature 
space  of  this  classifier  (see  figure  5).  The  prohibition  samples  are  clustered 
in  the  low  pitch  mean  and  high  energy  variance  region.  The  approval  and 
attention  classes  form  a  cluster  at  the  high  pitch  mean  and  high  energy  variance 
region.  The  soothing  samples  are  clustered  in  the  low  pitch  mean  and  low 
energy  variance  region.  The  neutral  samples  have  low  pitch  mean  and  are 
divided  into  two  regions  in  terms  of  their  energy  variance  values.  The  neutral 
samples  with  high  energy  variance  are  clustered  separately  from  the  rest  of  the 
classes  (in  between  prohibition  and  soothing),  while  the  ones  with  lower  energy 
variance  are  clustered  within  the  soothing  class.  These  findings  are  consistent 
with  the  proposed  prior  knowledge.  Approval,  attention,  and  prohibition  are 
associated  with  high  intensity  while  soothing  exhibits  much  lower  intensity. 
Neutral  samples  span  from  low  to  medium  intensity,  which  makes  sense  because 
the  neutral  class  includes  a  wide  variety  of  utterances. 

Based  on  this  observation,  the  first  classification  stage  uses  energy-related 
features  to  classify  soothing  and  low-intensity  neutral  with  from  the  other 
higher  intensity  classes  (see  figure  6).  In  the  second  stage,  if  the  utterance 
had  a  low  intensity  level,  another  classifier  decides  whether  it  is  soothing  or 
neutral.  If  the  utterance  exhibited  high  intensity,  the  Fi  -  Fg  pair  is  used  to 
classify  among  prohibition,  the  approval-attention  cluster,  and  high  intensity 
neutral.  An  additional  stage  is  required  to  classify  between  approval  and  at¬ 
tention  if  the  utterance  happened  to  fall  within  the  approval-attention  cluster. 


Approval 

Attention 


Figure  6.  The  classification  stages  of  the  multi-stage  classifier. 

2.1.5.  Stage  1:  Soothing  —  Low-Intensity  Neutral  versus  Everything  Else 

The  first  two  columns  in  table  3  show  the  classification  performance  of  the  top 
four  feature  pairs  (sorted  according  to  how  well  each  pair  classifies  soothing  and 


Feature 

Pair 

Pair  Perf. 
Mean  (%) 

Feature 

Set 

Perf. 

Mean  (%) 

F9 1  F\i 

93.0 

F9F11 

93.0 

F\o,Fn 

91.8 

F9F10F11 

93.6 

F2,Fq 

91.7 

F2F9F10E11 

93.3 

f7)f9 

91.3 

F2FyF<jF\qFi  1 

91.6 

Table  3.  Classification  results  in  stage  1. 

low-intensity  neutral  against  other  classes).  The  last  two  columns  illustrate 
the  classification  results  as  each  pair  is  added  sequentially  into  the  feature  set. 
The  final  classifier  was  constructed  using  the  best  feature  set  (energy  variance, 
maximum  energy,  and  energy  range),  with  an  average  performance  of  93.6 
percent. 

2.1.6.  Stage  2 A:  Soothing  versus  Low-Intensity  Neutral 

Since  the  global  and  energy  features  were  not  sufficient  in  separating  these  two 
classes,  new  features  were  introduced  into  the  classifier.  Fernald’s  prototypical 
prosodic  patterns  for  soothing  suggest  looking  for  a  smooth  pitch  contour  ex¬ 
hibiting  a  frequency  down-sweep.  Visual  observations  of  the  neutral  samples  in 
the  data  set  indicated  that  neutral  speech  generated  flatter  and  choppier  pitch 
contours  as  well  as  less-modulated  energy  contours.  Based  on  these  postula¬ 
tions,  a  classifier  using  five  features  (number  of  pitch  segments,  average  length 
of  pitch  segments,  minimum  length  of  pitch  segments,  slope  of  pitch  contour, 
and  energy  range)  was  constructed.  The  slope  of  the  pitch  contour  indicated 
whether  the  contour  contained  a  down-sweep  segment.  It  was  calculated  by 
performing  a  linear  fit  on  the  contour  segment  starting  at  the  maximum  peak. 
This  classifier’s  average  performance  is  80.3  percent. 

2.1.7.  Stage  2B:  Approval- Attention  versus  Prohibition  versus  High- Intensity 
Neutral 

A  combination  of  pitch  mean  and  energy  variance  works  well  in  this  stage.  The 
resulting  classifier’s  average  performance  is  90.0  percent.  Based  on  Fernald’s 
prototypical  prosodic  patterns,  it  was  speculated  that  pitch  variance  would 
be  a  useful  feature  for  distinguishing  between  prohibition  and  the  approval- 
attention  cluster.  Adding  pitch  variance  into  the  feature  set  increased  the 
classifier’s  average  performance  to  92.1  percent. 

2.1.8.  Stage  3:  Approval  versus  Attention 

Since  the  approval  class  and  attention  class  span  the  same  region  in  the  global 
pitch  versus  energy  feature  space,  prior  knowledge  (provided  by  Fernald’s  pro¬ 
totypical  prosodic  contours)  gave  the  basis  to  introduce  a  new  feature.  As 
mentioned  above,  approvals  are  characterized  by  an  exaggerated  rise-fall  pitch 
contour.  This  particular  pitch  pattern  proved  useful  in  distinguishing  between 
the  two  classes.  First,  a  three- degree  polynomial  fit  was  performed  on  each 
pitch  segment.  Each  segment’s  slope  sequence  was  analyzed  for  a  positive  slope 
followed  by  a  negative  slope  with  magnitudes  higher  than  a  threshold  value. 


Table  4.  Overall  classification  performance. 


The  longest  pitch  segment  that  contributed  to  the  rise-fall  pattern  (which  was 
0  if  the  pattern  was  non-existent)  was  recorded.  This  feature,  together  with 
pitch  variance,  was  used  in  the  final  classifier  and  generated  an  average  perfor¬ 
mance  of  70.5  percent.  Approval  and  attention  are  the  most  difficult  to  classify 
because  both  classes  exhibit  high  pitch  and  intensity.  Although  the  shape  of 
the  pitch  contour  helped  to  distinguish  between  the  two  classes,  it  is  very  diffi¬ 
cult  to  achieve  high  classification  performance  without  looking  at  the  linguistic 
content  of  the  utterance. 

2.1.9.  Overall  Performance 

The  final  classifier  was  evaluated  using  a  new  test  set  generated  by  the  same 
female  speakers,  containing  371  utterances.  Because  each  mini-classifier  was 
trained  using  different  portions  of  the  original  database  (for  the  single-stage 
classifier),  a  new  data  set  was  gathered  to  ensure  that  no  mini-classifier  stage 
was  tested  on  data  used  to  train  it.  Table  4  shows  the  resulting  classification 
performance  and  compares  it  to  an  instance  of  the  cross-validation  results  of  the 
best  single-stage  five-way  classifier  obtained  using  the  five  features  described 
in  section  2.1.4.  Both  classifiers  perform  very  well  on  prohibition  utterances. 
The  multi-stage  classifier  performs  significantly  better  in  classifying  the  diffi¬ 
cult  classes,  i.e.,  approval  versus  attention  and  soothing  versus  neutral.  This 
verifies  that  the  features  encoding  the  shape  of  the  pitch  contours  (derived 
from  prior  knowledge  provided  by  Fernald’s  prototypical  prosodic  patterns) 
were  very  useful. 

It  is  important  to  note  that  both  classifiers  produce  acceptable  failure 
modes  (i.e.,  strongly  valenced  intents  are  incorrectly  classified  as  neutrally  va- 
lenced  intents  and  not  as  oppositely  valenced  ones).  All  classes  are  sometimes 
incorrectly  classified  as  neutral.  Approval  and  attentional  bids  are  generally 
classified  as  one  or  the  other.  Approval  utterances  are  occasionally  confused 
for  soothing  and  vice  versa.  Only  one  prohibition  utterance  was  incorrectly 
classified  as  an  attentional  bid,  which  is  acceptable.  The  single-stage  classifier 
made  one  unacceptable  error  of  confusing  a  neutral  utterance  as  a  prohibition. 
In  the  multi-stage  classifier,  some  neutral  utterances  are  classified  as  approval, 
attention,  and  soothing.  This  makes  sense  because  the  neutral  class  covers  a 
wide  variety  of  utterances. 


3.  Integration  with  the  Emotion  System 

The  output  of  the  recognizer  is  integrated  into  the  rest  of  Kismet’s  synthetic 
nervous  system  as  shown  in  figure  7.  The  entry  point  for  the  classifier’s  re¬ 
sult  is  at  the  auditory  perceptual  system.  Here,  it  is  fed  into  an  associated 
releaser  process.  In  general,  there  are  many  different,  kinds  of  releasers  defined 
for  Kismet,  each  combining  different  contributions  from  a  variety  of  perceptual 
and  motivational  systems.  Here,  I  only  discuss  those  releasers  related  to  the 
input  from  the  vocal  classifier.  The  output  of  each  vocal  affect  releaser  repre¬ 
sents  its  perceptual  contribution  to  the  rest  of  the  SNS.  Each  releaser  combines 
the  incoming  recognizer  signal  with  contextual  information  (such  as  the  cur¬ 
rent  “emotional”  state)  and  computes  its  level  of  activation  according  to  the 
magnitude  of  its  inputs.  If  its  activation  passes  above  threshold,  it  passes  its 
output  on  to  the  emotion  system. 
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Figure  7.  System  architecture  for  integrating  vocal  classifier  input  to  Kismet’s 
emotion  system. 


Within  the  emotion  system,  the  output  of  each  releaser  must  first  pass 
through  the  affective  assessment  subsystem  in  order  to  influence  emotional 
behavior.  Within  this  assessment  subsystem,  each  releaser  is  evaluated  in  af¬ 
fective  terms  by  an  associated  somatic  marker  (SM)  process.  This  mechanism 
is  inspired  by  the  Somatic  Marker  Hypothesis  of  [3]  where  incoming  perceptual 


Category 

Arousal 

Valence 

Stance 

Typical 

Expression 

Approval 

medium 

high 

high 

positive 

approach 

pleased 

Prohibition 

low 

high 

negative 

withdraw 

sad 

Comfort 

low 

medium 

positive 

neutral 

content 

Attention 

high 

neutral 

aproach 

interest 

Neutral 

neutral 

neutral 

neutral 

calm 

Table  5.  Table  mapping  [ A ,  V,  S]  to  classified  affective  intents.  Praise  biases 
the  robot  to  be  “happy,”  prohibition  biases  it  to  be  “sad,”  comfort  evokes  a 
“content,  relaxed”  state,  and  attention  is  “arousing”. 

information  is  “tagged”  with  affective  information.  Table  5  summarizes  how 
each  vocal  affect  releaser  is  somatically  tagged. 

There  are  three  classes  of  tags  that  the  affective  assessment  phase  uses 
to  affectively  characterize  its  perceptual,  motivational,  and  behavioral  input. 
Each  tag  has  an  associated  intensity  that  scales  its  contribution  to  the  overall 
affective  state.  The  arousal  tag,  A ,  specifies  how  arousing  this  percept  is  to 
the  emotional  system.  Positive  values  correspond  to  a  high  arousal  stimulus 
whereas  negative  values  correspond  to  a  low  arousal  stimulus.  The  valence  tag, 
V ,  specifies  how  good  or  bad  this  percept  is  to  the  emotional  system.  Positive 
values  correspond  to  a  pleasant  stimulus  whereas  negative  values  correspond 
to  an  unpleasant  stimulus.  The  stance  tag,  S ,  specifies  how  approachable 
the  percept  is.  Positive  values  correspond  to  advance  whereas  negative  values 
correspond  to  retreat.  Because  there  are  potentially  many  different  kinds  of 
factors  that  modulate  the  robot’s  affective  state  (e.g.,  behaviors,  motivations, 
perceptions),  this  tagging  process  converts  the  myriad  of  factors  into  a  common 
currency  that  can  be  combined  to  determine  the  net  affective  state. 

For  Kismet,  the  [A,V,  S]  trio  is  the  currency  the  emotion  system  uses 
to  determine  which  emotional  response  should  be  active.  This  occurs  in  two 
phases:  First,  all  somatically  marked  inputs  are  passed  to  the  emotion  elici- 
tor  stage.  Each  emotion  process  has  an  elicitor  associated  with  it  that  filters 
each  of  the  incoming  [A,  1^,5]  contributions.  Only  those  contributions  that 
satisfy  the  [ A ,  V,  S ]  criteria  for  that  emotion  process  are  allowed  to  contribute 
to  its  activation.  This  filtering  is  done  independently  for  each  class  of  affective 
tag.  For  instance,  a  valence  contribution  with  a  large  negative  value  will  not 
only  contribute  to  the  sorrow  emotion  process,  but  to  the  fear,  anger,  and 
distress  processes  as  well.  Given  all  these  factors,  each  elicitor  computes  its 
net  [ A ,  V ,  S]  contribution  and  activation  level,  and  passes  them  to  the  associ¬ 
ated  emotion  process  within  the  emotion  arbitration  subsystem.  In  the  second 
stage,  the  emotion  processes  within  the  emotion  arbitration  subsystem  compete 


for  activation  based  on  their  activation  level.  There  is  an  emotion  process  for 
each  of  Ekman’s  six  basic  emotions  [4].  Ekman  posits  that  these  six  emotions 
are  innate  in  humans,  and  all  others  are  acquired  through  experience.  The 
“Ekman  six”  encompass  joy,  anger,  disgust,  feax,  sorrow,  and  surprise. 

If  the  activation  level  of  the  winning  emotion  process  passes  above  thresh¬ 
old,  it  is  allowed  to  influence  the  behavior  system  and  the  motor  expression 
system.  There  are  actually  two  threshold  levels,  one  for  expression  and  one  for 
behavior.  The  expression  threshold  is  lower  than  the  behavior  threshold;  this 
allows  the  facial  expression  to  lead  the  behavioral  response.  This  enhances  the 
readability  and  interpretation  of  the  robot’s  behavior  for  the  human  observer. 
For  instance,  given  that  the  caregiver  makes  an  attentional  bid,  the  robot’s 
face  will  first  exhibit  an  aroused  and  interested  expression,  then  the  orienting 
response  ensues.  By  staging  the  response  in  this  manner,  the  caregiver  gets  im¬ 
mediate  expressive  feedback  that  the  robot  understood  her  intent.  For  Kismet, 
this  feedback  can  come  in  a  combination  of  facial  expression,  tone  of  voice,  or 
posture.  The  robot’s  facial  expression  also  sets  up  the  human’s  expectation  of 
what  behavior  will  soon  follow.  As  a  result,  the  human  observing  the  robot 
can  see  its  behavior,  in  addition  to  having  an  understanding  of  why  the  robot 
is  behaving  in  that  manner.  As  I  have  argued  previously,  readability  is  an 
important  issue  for  social  interaction  with  humans. 

3.1.  Affective  Intent  Experiment 

Communicative  efficacy  has  been  tested  with  people  very  familiar  with  the 
robot  as  well  as  with  naive  subjects  in  multiple  languages  (French,  German, 
English,  Russian,  and  Indonesian).  Female  subjects  ranging  in  age  from  22  to 
54  were  asked  to  praise,  scold,  soothe,  and  to  get  the  robot’s  attention.  They 
were  also  asked  to  signal  when  they  felt  the  robot  “understood”  them.  All 
exchanges  were  video  recorded  for  later  analysis. 

Figure  8  illustrates  a  sample  event  sequences  that  occurred  during  ex¬ 
periment  sessions  of  a  naive  speaker.  Each  row  represents  a  trial  in  which 
the  subject  attempts  to  communicate  an  affective  intent  to  Kismet.  For  each 
trial,  we  recorded  the  number  of  utterances  spoken,  Kismet’s  cues,  subject’s 
responses  and  comments,  as  well  as  changes  in  prosody,  if  any. 

3.2.  Discussion 

Recorded  events  show  that  subjects  in  the  study  made  ready  use  of  Kismet’s 
expressive  feedback  to  assess  when  the  robot  “understood”  them.  The  robot’s 
expressive  repertoire  is  quite  rich,  including  both  facial  expressions  and  shifts  in 
body  posture.  The  subjects  varied  in  their  sensitivity  to  the  robot’s  expressive 
feedback,  but  all  used  facial  expression,  body  posture,  or  a  combination  of  both 
to  determine  when  the  utterance  had  been  properly  communicated  to  the  robot. 
All  subjects  would  reiterate  their  vocalizations  with  variations  about  a  theme 
until  they  observed  the  appropriate  change  in  facial  expression.  If  the  wrong 
facial  expression  appeared,  they  often  used  strongly  exaggerated  prosody  to 
“correct”  the  “misunderstanding”.  In  trial  20-22  of  subject  S3’s  experiment 
session,  she  giggled  when  kismet  smiled  despite  her  scolding,  commented  that 
volume  would  help,  and  thus  spoke  louder  in  the  next  trial.  In  general,  the 
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Figure  8.  Sample  experiment  session  of  a  naive  speaker,  S3. 


subjects  used  Kismet’s  expressive  feedback  to  regulate  their  own  behavior. 

Kismet’s  expression  through  face  and  body  posture  becomes  more  intense 
as  the  activation  level  of  the  corresponding  emotion  process  increases.  For 
instance,  small  smiles  verses  large  grins  were  often  used  to  discern  how  whappv” 
the  robot  appeared.  Small  ear  perks  verses  widened  eyes  with  elevated  ears  and 
craning  the  neck  forward  were  often  used  to  discern  growing  levels  of  “interest” 
and  “attention”.  The  subjects  could  discern  these  intensity  differences  and 
several  modulated  their  own  speech  to  influence  them.  For  example,  in  trials  1 
and  2,  Kismet  responded  to  subject  S3’s  praise  by  perking  its  ears  and  showing 
a  small  grin.  In  the  next  two  trials  the  subject  raised  her  pitch  while  praising 
Kismet  to  coax  a  stronger  response.  In  trials  6-8  Kismet  smiles  broadly.  We 
found  that  subjects  often  use  Kismet’s  expressions  to  regulate  their  affective 
impact  on  the  robot. 

During  course  of  the  interaction,  several  interesting  dynamic  social  phe¬ 
nomena  arose.  Often  these  occurred  in  the  context  of  prohibiting  the  robot.  For 


instance,  several  of  the  subjects  reported  experiencing  a  very  strong  emotional 
response  immediately  after  “successfully”  prohibiting  the  robot.  In  these  cases, 
the  robot’s  saddened  face  and  body  posture  was  enough  to  arouse  a  strong  sense 
of  empathy.  The  subject  would  often  immediately  stop  and  look  to  the  experi¬ 
menter  with  an  anguished  expression  on  her  face,  claiming  to  feel  “terrible”  or 
“guilty”.  In  this  emotional  feedback  cycle,  the  robot’s  own  affective  response 
to  the  subject’s  vocalizations  evoked  a  strong  and  similar  emotional  response 
in  the  subject  as  well.  This  empathic  response  can  be  considered  to  be  a  form 
of  entrainment. 

Another  interesting  social  dynamic  we  observed  involved  affective  mirror¬ 
ing  between  robot  and  human.  For  instance,  for  another  female  subject  (S2), 
she  issued  a  medium  strength  prohibition  to  the  robot,  which  caused  it  to  dip 
its  head.  She  responded  by  lowering  her  own  head  and  reiterating  the  pro¬ 
hibition,  this  time  a  bit  more  foreboding.  This  caused  the  robot  to  dip  its 
head  even  further  and  look  more  dejected.  The  cycle  continues  to  increase  in 
intensity  until  it  bottoms  out  with  both  subject  and  robot  having  dramatic 
body  postures  and  facial  expressions  that  mirror  the  other.  We  see  a  simi¬ 
lar  pattern  for  subject  S3  while  issuing  attentional  bids.  During  trials  14-16 
the  subject  mirrors  the  same  alert  posture  as  the  robot.  This  technique  was 
often  employed  to  modulate  the  degree  to  which  the  strength  of  the  message 
was  “communicated”  to  the  robot.  This  dynamic  between  robot  and  human  is 
further  evidence  of  entrainment. 

4.  Proto-Dialog 

Achievement  of  adult-level  conversation  with  a  robot  is  a  long  term  research 
goal.  This  involves  overcoming  challenges  both  with  respect  to  the  content  of 
the  exchange  as  well  as  to  the  delivery.  The  dynamics  of  turn-taking  in  adult 
conversation  are  flexible  and  robust.  Well  studied  by  discourse  theorists,  hu¬ 
mans  employ  a  variety  of  para-linguistic  social  cues,  called  envelope  displays ,  to 
regulate  the  exchange  of  speaking  turns  [2] .  Given  that  a  robotic  implementa¬ 
tion  is  limited  by  perceptual,  motor,  and  computational  resources,  could  such 
cues  be  useful  to  regulate  the  turn-taking  of  humans  and  robots? 

Kismet’s  turn- taking  skills  are  supplemented  with  envelope  displays  as 
posited  by  discourse  theorists.  These  paralinguistic  social  cues  (such  as  raising 
of  the  brows  at  the  end  of  a  turn,  or  averting  gaze  at  the  start  of  a  turn) 
are  particularly  important  for  Kismet  because  processing  limitations  force  the 
robot  to  take-turns  at  a  slower  rate  than  is  typical  for  human  adults.  However, 
humans  seem  to  intuitively  read  Kismet’s  cues  and  use  them  to  regulate  the 
rate  of  exchange  at  a  pace  where  both  partners  perform  well. 

4.1.  Envelope  Display  Experiment 

To  investigate  Kismet’s  turn- taking  performance  during  proto-dialogs,  we  in¬ 
vited  three  naive  subjects  to  interact  with  Kismet.  Subjects  ranged  in  age  from 
12  to  28  years  old.  Both  male  and  female  subjects  participated.  In  each  case, 
each  subject  was  simply  asked  to  carry  a  “play”  conversation  with  the  robot. 
The  exchanges  were  video  recorded  for  later  analysis.  The  subjects  were  told 


that  the  robot  did  not  speak  or  understand  English,  but  would  babble  to  them 
something  like  an  infant. 
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Figure  9.  The  left,  table  shows  data  illustrating  evidence  for  entrainment  of 
human  to  robot.  The  right  table  summarizes  Kismet’s  turn  taking  performance 
during  proto-dialog  with  three  naive  subjects.  Significant  disturbances  are 
small  clusters  of  pauses  and  interruptions  between  Kismet  and  the  subject 
until  turn- taking  become  coordinated  again 

Often  the  subjects  begin  the  session  by  speaking  longer  phrases  and  only 
using  the  robot’s  vocal  behavior  to  gauge  their  speaking  turn.  They  also  expect 
the  robot  to  respond  immediately  after  they  finish  talking.  Within  the  first 
couple  of  exchanges,  they  may  notice  that  the  robot  interrupts  them,  and  they 
begin  to  adapt  to  Kismet’s  rate.  They  start  to  use  shorter  phrases,  wait  longer 
for  the  robot  to  respond,  and  more  carefully  watch  the  robot’s  turn  taking  cues. 
The  robot  prompts  the  other  for  their  turn  by  craning  its  neck  forward,  raising 
its  brows,  and  looking  at  the  person’s  face  when  it’s  ready  for  them  to  speak. 
It  will  hold  this  posture  for  a  few  seconds  until  the  person  responds.  Often, 
within  a  second  of  this  display,  the  subject  does  so.  The  robot  then  leans  back 
to  a  neutral  posture,  assumes  a  neutral  expression,  and  tends  to  shift,  its  gaze 
away  from  the  person.  This  cue  indicates  that  the  robot  is  about  to  speak.  The 
robot,  typically  issues  one  utterance,  but  it  may  issue  several.  Nonetheless,  as 
the  exchange  proceeds,  the  subjects  tends  to  wait,  until  prompted. 

Before  the  subjects  adapt  their  behavior  to  the  robot’s  capabilities,  the 
robot  is  more  likely  to  interrupt  them.  There  tend  to  be  more  frequent  delays 
in  the  flow  of  “conversation”  where  the  human  prompts  the  robot  again  for  a 


response.  Often  these  “hiccups”  in  the  flow  appear  in  short  clusters  of  mutual 
interruptions  and  pauses  (often  over  2  to  4  speaking  turns)  before  the  turns  be¬ 
come  coordinated  and  the  flow  smoothes  out.  However,  by  analyzing  the  video 
of  these  human-robot,  “conversations”,  there  is  evidence  that  people  entrain 
to  the  robot  (see  the  table  to  the  left  in  figure  9).  These  “hiccups”  become 
less  frequent.  The  human  and  robot  are  able  to  carry  on  longer  sequences  of 
clean  turn  transitions.  At  this  point  the  rate  of  vocal  exchange  is  well  matched 
to  the  robot’s  perceptual  limitations.  The  vocal  exchange  is  reasonably  fluid. 
The  table  to  the  right  in  figure  9  shows  that  the  robot  is  engaged  in  a  smooth 
proto-dialog  with  the  human  partner  the  majority  of  the  time  (about  82%). 

5.  Conclusions 

Experimental  data  from  two  distinct  studies  suggests  that  people  do  use  the 
expressive  cues  of  an  anthropomorphic  robot  to  improve  the  quality  of  inter¬ 
action  between  them.  Whether  the  subjects  were  communicating  an  affective 
intent  to  the  robot,  or  engaging  it  in  a  play  dialog,  evidence  for  using  the 
robot’s  expressive  cues  to  regulate  the  interaction  and  to  entrain  to  the  robot 
were  observed.  This  has  the  effect  of  improving  the  quality  of  the  interaction 
as  a  whole.  In  the  case  of  communicating  affective  intent,  people  used  the 
robot’s  expressive  displays  to  ensure  the  correct  intent  was  understood  to  the 
appropriate  intensity.  In  the  case  of  proto-conversation,  the  subjects  quickly 
used  the  robot’s  cues  to  regulate  when  they  should  exchange  turns.  As  the 
result,  the  interaction  becomes  smoother  over  time  with  fewer  interruptions  or 
awkward  pauses.  These  results  signify  that  for  social  interactions  with  humans, 
expressive  robotic  faces  are  a  benefit  to  both  the  robot  and  to  the  human  who 
interacts  with  it. 
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